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Abstract: Clustering plays a vital role in the various 
areas of research like Data Mining, Image Retrieval, 
Bio-computing and many a lot. Distance measure 
plays an important role in clustering data points. 
Choosing the right distance measure for a given 
dataset is a biggest challenge. In this paper, we study 
various distance measures and their effect on different 
clustering. This paper surveys existing distance 
measures for clustering and present a comparison 
between them based on application domain, efficiency, 
benefits and drawbacks. This comparison helps the 
researchers to take quick decision about which 
distance measure to use for clustering. We conclude 
this work by identifying trends and challenges of 
research and development towards clustering. 
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I. INTRODUCTION 

Clustering is an important data mining technique 
that has a wide range of applications in many areas 
like biology, medicine, market research and image 
analysis etc.lt is the process of partitioning a set of 
objects into different subsets such that the data in each 
subset are similar to each other. In Cluster analysis 
Distance measure and clustering algorithm plays an 
important role [1]. 

An important step in any clustering is to select a 
distance measure, which will determine how similarity 
[1] of two elements is calculated. This will influence 
the shape of the clusters, as some elements may be 
close to one another according to one distance and 
farther away according to another. 

It is expected that distance between objects within a 
cluster should be minimum and distance between 
objects within different clusters should be maximum. 
In this paper we compare different distance measures. 
Comparison of these distance measures show that 
different distance measures behave differently 
depending on application domain. The rest of the paper 
is organized as follows: 

In section II, we discuss distance measures and its 
significance in nutshell; in section III, we present the 
comparison between these distances measures in 
TABLE I; In section IV, we describe how the accuracy 



can be measured and interpretation of the comparison; 
And we conclude the report. 

II. DISTANCE MEASURES AND ITS 
SIGNIFICANCE 

A cluster is a collection of data objects that are 
similar to objects within the same cluster and 
dissimilar to those in other clusters. Similarity between 
two objects is calculated using a distance measure 
[6]. Since clustering forms groups; it can be used as a 
pre-processing step for methods like classifications. 

Many distance measures have been proposed in 
literature for data clustering. Most often, these 
measures are metric functions; Manhattan distance, 
Minkowski distance and Hamming distance. Jaccard 
index, Cosine Similarity and Dice Coefficient are also 
popular distance measures. For non-numeric datasets, 
special distance functions are proposed. For example, 
edit distance is a well-known distance measure for text 
attributes. 

In this section we briefly elaborate seven commonly 
used distance measures. 

A. Euclidean Distance 

The Euclidean distance or Euclidean metric is the 
ordinary distance between two points that one would 
measure with a ruler. It is the straight line distance 
between two points. 

In a plane with pi at (xl, yl) and p2 at (x2, y2), it is 
V((xl - x2) 2 + (yl - y2) 2 ). 

In N dimensions, the Euclidean distance between two 
points p and q is V (Si=lN (pi-qi) 2 ) where pi (or qi) is 
the coordinate of p (or q) in dimension i. 

B. Manhattan Distance 

The distance between two points measured along 
axes at right angles. In a plane with pi at (xl, yl) and 
p2 at (x2, y2), it is Ixl - x2l + lyl - y2l. 

This is easily generalized to higher dimensions. 
Manhattan distance is often used in integrated circuits 
where wires only run parallel to the X or Y axis. It is 
also known as rectilinear distance, Minkowski's [7] [3] 
LI distance, taxi cab metric, or city block distance. 
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C. Bit-Vector Distance 



G. Dice Index 



An N x N matrix Mb is calculated. Each point has d 
dimensions and Mb (Pi, Pj) is determined as d-bit 
vector. This vector is obtained as follows: 

If the numerical value of the xth dimension of point 
is greater than the numerical value of the xth 
dimension of point pj,y then the bit x of Mb(Pi ,Pj) is 
set to 1 and bit x of Mb(Pj, Pi) is set to 0.A11 the bit 
vectors in Mb are then converted to integers. 

D. Hamming Distance 

The Hamming distance between two strings of 
equal length is the number of positions for which the 
corresponding symbols are different. 

Let x, y A A n. We define the Hamming distance 
between x and y, denoted dH(x, y), to be the number 
of places where x and y are different. 

The Hamming distance [1] [6] can be interpreted as 
the number of bits which need to be changed 
(corrupted) to turn one string into other. Sometimes the 
number of characters is used instead of the number of 
bits. Hamming distance can be seen as Manhattan 
distance between bit vectors. 

E. Jaccard Index 

The Jaccard index, also known as the Jaccard 
similarity coefficient is a statistic used for comparing 
the similarity and diversity of sample sets. 

The Jaccard coefficient [11] measures similarity 
between sample sets, and is defined as the size of the 
intersection divided by the size of the union of the 
sample sets: 

V(A,B) = lAHB l 
lAUB l 

F. Cosine Index 

It is a measure of similarity between two vectors of 
n dimensions by finding the angle between them, often 
used to compare documents in text mining. Given two 
vectors of attributes, A and B, the cosine similarity 
[11], 0, is represented using a dot product and 
magnitude as 



6 — arccos 



A.B 



\\A\\\\B\\ 



For text matching, the attribute vectors A and B are 
usually the tf-idf vectors of the documents. Since the 
angle, 0, is in the range of [0, n], the resulting 
similarity will yield the value of n as meaning exactly 
opposite, 7i / 2 meaning independent, 0 meaning 
exactly the same, with in-between values indicating 
intermediate similarities or dissimilarities 



Dice's coefficient [11] (also known as the Dice 
Coefficient) is a similarity measure related to the 
Jaccard index. 

For sets X and Y of keywords used in information 
retrieval, the coefficient may be defined as: 



S = 



2\XHY\ 
\X\ + \Y\ 



When taken as string similarity measure, the 
coefficient may be calculated for two strings, x and y 
using bigrams as follows: 



S = 



2n t 
n x +n y 



Where nt is the number of character bigrams found in 
both strings, nx is the number of bigrams in string x 
and ny is the number of bigrams in string y. 

III. ACCURACY AND RESULT INTERPRETATION 

In general, the larger the number of sub-clusters 
produced by the clustering the more accurate the final 
result is. However, too many sub-clusters will slow 
down the clustering. The above comparison table 
compares 5 proximity measures. This comparison is 
based on 4 different criteria which are generally 
required to decide upon distance measure and 
clustering algorithms. 

All above comparisons are tested using standard 
synthetic dataset generated by Syndeca [3] Software 
and few of it is tested using open source clustering tool 
CLUTO. 

IV. CONCLUSION 

This paper surveys existing proximity measures for 
clustering and presents a comparison between them 
based on application domain, efficiency, benefits and 
drawbacks. This comparison helps the researchers to 
take quick decision about which distance measure to 
use for clustering. We ran our experiments on 
synthetic dataset s for its validation. Future work 
involves running the experiments on larger and 
different kinds of datasets and extending our study to 
other proximity measures and clustering algorithms. 
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Distance 
Measure 


Formula 


Algorithms In 
which it is Used 


Benefits 


Drawbacks 


Application 
Area 


Euclidean 


V ((xi - x 2 ) 2 + (yi - 

y 2 ) 2 ) 


-Partitional 
Algorithms 

-K Modes 

- AutoClass 
-ROCK 


E 
i 
1 


^asy to 

mplement and 
'est 


Results are greatly 
influenced by 
variables that have 
the largest value. 

Does not work well 
for image data, 
Document 
Classification 


-Appl. 
Involving 
Interval Data 

- In health 
psychology 
analysis 

- DNA 
Analysis 


Manhattan 


Ixx - x 2 l + lyi - y 2 l. 


Partitional 
Algorithms 


Easily 

generalized to 

higher 

dimensions 


Does not work well 
for image data and 
Document 
Classification 


In Integrated 
Circuits 








Handles both 
Continuous and 
categorical 
variables 






Cosine 




Ontology and 


Does not work well 
for nominal data 


Text Mining 


Similarity 


G = arccos A • B 
II A II II Bll 


Graph based 










Jaccard Index 


V(A,B)= lAHBl 
lAUBl 


Neural Network 


Handles both 
Continuous and 
categorical 
variables 


Does not work well 
for nominal data 


Document 
classification 
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